06_analysis_2

Author

group22

Load Libraries

library("tidyverse")
library("ggridges")
library("RColorBrewer")

Load Data

Loading the data from .csv file that has been downloaded.

df <- read_csv("../_raw/project_data.csv")

Save loaded data

Saving the dataset as .tsv file in data directory. Also reading the saved file to make sure it works.

write_tsv(df, "../data/01_dat_load.tsv")

df <- read_tsv("../data/01_dat_load.tsv")
library("tidyverse")

Checking for missing values

In the first part of data cleaning, we focus on identifying and removing any missing values in our dataset. Missing data can significantly impact the quality of our analysis, leading to biased or inaccurate results.

df <- read_tsv("../data/01_dat_load.tsv")

df_cleaned <- df |> 
  drop_na()

rows_removed <- nrow(df) - nrow(df_cleaned)
cat("Rows removed because of missing values:", rows_removed, "\n")
Rows removed because of missing values: 0 

From the first part we can see that the dataset didn’t have any missing variables. If there would have been any missing values, we would have removed those rows with missing values.

Checking for right type and values of each variable

In the second part we want to ensure that each of the variables in the dataset has expected type. By doing this we can make sure that we don’t have any faulty variables and we understand our variables for further analysis.

column_types <- df_cleaned |> 
  summarise(across(everything(), class))

print(column_types)
# A tibble: 1 × 22
  Diabetes_binary HighBP  HighChol CholCheck BMI     Smoker  Stroke 
  <chr>           <chr>   <chr>    <chr>     <chr>   <chr>   <chr>  
1 numeric         numeric numeric  numeric   numeric numeric numeric
# ℹ 15 more variables: HeartDiseaseorAttack <chr>, PhysActivity <chr>,
#   Fruits <chr>, Veggies <chr>, HvyAlcoholConsump <chr>, AnyHealthcare <chr>,
#   NoDocbcCost <chr>, GenHlth <chr>, MentHlth <chr>, PhysHlth <chr>,
#   DiffWalk <chr>, Sex <chr>, Age <chr>, Education <chr>, Income <chr>

We can see that all of our variables are numeric.

Filtering incorrect values

In the third part of cleaning we want to filter out incorrect values. With the dataset received we also received range of values each numeric variable. So this part is all about making sure that each of the value falls into correct range of values.

original_df <- df_cleaned

df_cleaned <- df_cleaned |>
  filter(Diabetes_binary %in% c(0, 1, 2),
         HighBP %in% c(0, 1),
         HighChol %in% c(0, 1),
         CholCheck %in% c(0, 1),
         BMI > 0,
         Smoker %in% c(0, 1),
         Stroke %in% c(0, 1),
         HeartDiseaseorAttack %in% c(0, 1),
         PhysActivity %in% c(0, 1),
         Fruits %in% c(0, 1),
         Veggies %in% c(0, 1),
         HvyAlcoholConsump %in% c(0, 1),
         AnyHealthcare %in% c(0, 1),
         NoDocbcCost %in% c(0, 1),
         GenHlth %in% 1:5,
         MentHlth %in% 0:30,
         PhysHlth %in% 0:30,
         DiffWalk %in% c(0, 1),
         Sex %in% c(0, 1),
         Age %in% 1:13,
         Education %in% 1:6,
         Income %in% 1:8)

rows_removed_after_filter <- nrow(original_df) - nrow(df_cleaned)
cat("Rows removed because of incorrect values:", rows_removed_after_filter, "\n")
Rows removed because of incorrect values: 0 

From this part we can see that the dataset didn’t have any out of range values for all of the variables. If there would have been any incorrect values, we would have removed those rows with incorrect values.

Writing the file

write_tsv(df, "../data/02_dat_clean.tsv")
library("tidyverse")

Load Data

df_cleaned <- read_tsv("../data/02_dat_clean.tsv")

Adding new variables

Smoking

Changing the smokers variable from binary format to character. The value “0” which indicates non-smokers is changed to “Non-Smoker”, and the value “1” changed to “Smoker” respectively.

df_aug <- df_cleaned |> 
  mutate(Smoking_Status = case_when(
    Smoker == 0 ~ "Non-Smoker",
    Smoker == 1 ~ "Smoker"
  ))

Diabetes

Converting the diabetes_binary variable from binary to character.

df_aug <-
  df_aug |> 
  mutate(Diabetes_Status = case_when(
    Diabetes_binary == 0 ~ "Non-Diabetic",
    Diabetes_binary == 1 ~ "Diabetic"))

Gender

Changing the Sex variable from binary to character.

df_aug <-
  df_aug |> 
  mutate(Sex_character = case_when(
    Sex == 0 ~ "Female",
    Sex == 1 ~ "Male"))

Age

The age was changed from the 13-level age category to the corresponding range values.

df_aug <-
  df_aug |> 
  mutate(Age_Range = case_when(
    Age == 1 ~ "18-24",
    Age == 2 | Age == 3 ~ "25-34",
    Age == 4 | Age == 5 ~ "35-44",
    Age == 6 | Age == 7 | Age == 8 ~ "45-59",
    Age == 9 | Age == 10 | Age == 11 ~ "60-74",
    Age == 12 | Age == 13 ~ "75-"))

Income

Income variable changed from a scale from 1-9 to three classes: Poor, Average and Wealthy.

df_aug <-
  df_aug |> 
  mutate(Income_Class = case_when(
    Income <= 3 ~ "Poor",
    Income  > 3 & Income <= 6 ~ "Average",
    Income > 6 ~ "Wealthy"))

Physical Activity

This variable transforms the “PhysActivity” from binary to character.

df_aug <-
  df_aug |> 
  mutate(Physically_Active = case_when(
    PhysActivity == 0 ~ "No",
    PhysActivity == 1 ~ "Yes"))

Habits

One variable that should be added is habits, a character variable that describes whether the lifestyle of the individual is healthy or not. This depends on many variables, namely Smoking, Alcohol consumption, Fruits and Veggies consumption and Physical Activity. For the first 2, we assigned -1 point, while for the latter 3, we assigned +1 respectively. A high score indicates a healthy lifestyle, a medium one indicates an average lifestyle, while a low score indicates an unhealthy one.

df_aug <-
  df_aug |> 
  mutate(Habit_Score = Veggies + Fruits + PhysActivity - Smoker - HvyAlcoholConsump) |> 
  mutate(Habits = case_when(
    Habit_Score < 0 ~ "Unhealthy",
    Habit_Score >= 0 & Habit_Score < 2 ~ "Average",
    Habit_Score >= 2 ~ "Healthy"),
    .keep = "unused")

Health risk

The health risk depends on the prevalence of a heart attack/disease, stroke, high BP and high cholesterol. This variable weights the heart disease and stroke variables more than the others.

df_aug <-
  df_aug |> 
  mutate(Risk_Score = HighBP + HighChol + 2*Stroke + 2*HeartDiseaseorAttack) |> 
  mutate(Health_Risk = case_when(
    Risk_Score < 2 ~ "Low Risk",
    Risk_Score >= 2 & Risk_Score < 4 ~ "Medium Risk",
    Risk_Score >= 4 ~ "High Risk"),
    .keep = "unused")

Socio-economical Class

At first we created a binary variable for the educational background. 0 refers to individuals who didn’t attend school or attend school and didn’t graduate high-school, while 1 refers to those who graduated high school and maybe received higher education.

df_aug <-
  df_aug |> 
  mutate(Education_binary = case_when(
    Education < 4 ~ 0,
    Education >= 4 ~ 1)) |> 
  mutate(SE_Score = Income + AnyHealthcare + Education_binary + PhysActivity) |> 
  mutate(SE_Background = case_when(
    SE_Score < 6 ~ "Lower Class",
    SE_Score >= 6 ~ "Higher Class"),
    .keep = "unused")

Writing the file

write_tsv(df_aug, file = "../data/03_dat_aug.tsv")
library("tidyverse")
library("ggridges")
library("RColorBrewer")

Load data

df_aug <- read_tsv("../data/03_dat_aug.tsv")

Visualization

  1. The distribution of BMI in the different age groups, including a comparison between smokers and non-smokers.
plot_1 <- 
  df_aug |> 
  ggplot(aes(x = BMI, y = Age_Range)) +
  geom_density_ridges(aes(fill = Age_Range) ,alpha=0.6) + 
  facet_wrap(~Smoking_Status)+
  labs(x = "BMI",
       y = "Age Group") +
  theme_minimal(base_size = 14)+
  theme(legend.position = "none",
        plot.background = element_rect(fill = "white", color="white")) +
  scale_fill_viridis_d(option = "rocket") +
  labs(title = "BMI distribution over the different age groups", 
       subtitle = "A comparison between smokers and non-smokers")

plot_1
Picking joint bandwidth of 0.972
Picking joint bandwidth of 1.18

As it is illustrated from the plot above, smoking status affects mostly the youngest age groups, namely 18-24 and 25-34, where it is shown that smokers have a generally higher BMI than the non-smokers. For the other age groups, the behavior is almost the same between smokers and non-smokers, so smoking doesn’t appear as a main factor influencing BMI for individuals over 35. However, the general tendency is that BMI increases over the age, as it can be seen by the peak.

  1. Comparison of BMI between income classes, divided into individuals with and without physical activity.
df_aug |> 
  ggplot(mapping = aes(x = BMI,
                       y = fct_relevel(Income_Class,"Poor","Average","Wealthy"))) +
  geom_boxplot(aes(fill = Physically_Active),
               outlier.shape = NA,,alpha=0.6) +
  xlim(0,60) +
  xlab("BMI") +
  ylab("Income Class") +
  theme(legend.position = "none") +
  scale_fill_manual(values = c("aquamarine", "blue")) +
  labs(title = "Comparison of BMI between different income classes", 
       subtitle = "Red: No Physical Activity, Blue: Physical Activity")
Warning: Removed 260 rows containing non-finite values (`stat_boxplot()`).

The graph displays a general decrease of BMI in physically active individuals, comparing to inactive ones. Also, it is evident that BMI gradually increases from the wealthiest to the poorest income group, regardless of the physical activity status.

  1. Distribution of habits among individuals between genders
plot_2 <- 
  df_aug |> 
  ggplot(mapping = aes(x = fct_relevel(Habits,"Unhealthy","Average","Healthy"),
                       fill = Sex_character)) +
  geom_bar(position = position_dodge(),alpha=0.8) +
  facet_wrap(~Diabetes_Status) +
  theme_minimal(base_size = 14)+
  scale_fill_manual(name = "Gender", values = c("midnightblue", "mediumorchid2")) +
  #scale_fill_manual(values = c("antiquewhite4", "coral2"))
  #scale_fill_viridis_d() +
  theme(legend.position = "right",
        plot.background = element_rect(fill = "white", color="white")) +
  ylab("Count") +
  xlab("Habits") +
  labs(title = "The distribution of habits among individuals", 
       subtitle = "Comparison between 2 genders")

plot_2

Among the individuals with healthy habits, it is illustrated that, in both diabetic and non-diabetic cases, more women have healthier habits, whereas in the “Average” habits group, the results are more conflicting. It can be seen that while in the non-diabetic cases, more women have an “Average” lifestyle, it is the opposite for the diabetic counterparts. It is also remarkable that female individuals in the non-diabetic group doubled from the “Average” to the “Healthy” group, while the increase is slighter in the diabetic cases.

On the contrary, regarding the male counterparts, among the diabetic cases, there is not an important change between “Average” and “Healthy” groups, while in the non-diabetic cases, the “Healthy” cases where nearly 2000 more than the “Average” ones.

Saving Plots

plot_1 |> 
  ggsave(filename = "04_key_plot_1.png", path = "../results/")
Saving 7 x 5 in image
Picking joint bandwidth of 0.972

Picking joint bandwidth of 1.18
plot_2 |> 
  ggsave(filename = "04_key_plot_2.png", path = "../results/")
Saving 7 x 5 in image
library("tidyverse")
library("broom")

Load data

df <- read_tsv("../data/03_dat_aug.tsv")
Rows: 70692 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (9): Smoking_Status, Diabetes_Status, Sex_character, Age_Range, Income_...
dbl (23): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
table(df$Diabetes_binary)

    0     1 
35346 35346 

As it can be seen, the original dataset has an equal 50-50 split of respondents with no diabetes with either prediabetes or diabetes. The target variable Diabates_binary has 2 classes. 0 is for no diabetes, and 1 is for prediabetes or diabetes.

This script will the first one related to the analysis part. The focus here will be on interpreting the correlation coefficients to see what interaction exists between the variables and how they affect the variable related to the diagnosis of diabetes (Diabetes_binary).

The analysis will continue with the creation of a general linear model. The model will be tested with all available variables and an interpretation of the results obtained will be made. This script will end with the use of the “step” method in order to find out which are the best variables according to our data. What is more, a comparison between models will be made as well regarding to AIC and the selected variables.

Correlation analysis

Thus, what is correlation? Correlation is an statistical measure of a relationship involving two variables. Specifically, correlation reffers to a linear relationship between two independent variables. The correlation coefficient is the numerical measure of a statistical correlation and indicates the strength of the relationship.

One of the major restrictions of correlation is that it measures relationships only between numerical variables. If the relationship between categorical variables wants to be made, performing a chi-square test would be a good option. However, in this case, a dataset only containing numerical variables will be needed.

df_numeric <- df |>
  select_if(is.numeric)

There are several correlation coefficient formulas such as the Sample correlation coefficient, the Population correlation coefficient or the Rank correlation coefficient. The formula selected for this analysis is the Pearson product-moment correlation. Also known as the Pearson correlation. This is the most common correlation coefficient and it can be computed for any data set that has a finite covariance matrix. To achieve the values, a division between the covariance of two variables by the product of their standard deviations has to be made.

The following code computes the Pearson correlation between the numerical variables of the data set and shows the obtained values.

cor_matrix <- cor(df_numeric, method = "spearman")

print("Correlation Matrix:")
[1] "Correlation Matrix:"
print(cor_matrix)
                     Diabetes_binary      HighBP    HighChol     CholCheck
Diabetes_binary           1.00000000  0.38151555  0.28921281  0.1153816171
HighBP                    0.38151555  1.00000000  0.31651485  0.1032832913
HighChol                  0.28921281  0.31651485  1.00000000  0.0859813978
CholCheck                 0.11538162  0.10328329  0.08598140  1.0000000000
BMI                       0.32835561  0.27554596  0.15833925  0.0543965663
Smoker                    0.08599896  0.08743830  0.09339831 -0.0043305151
Stroke                    0.12542678  0.12905987  0.09978619  0.0225293810
HeartDiseaseorAttack      0.21152340  0.21075039  0.18118664  0.0434971444
PhysActivity             -0.15866560 -0.13610217 -0.09045316 -0.0082493633
Fruits                   -0.05407656 -0.04085216 -0.04738362  0.0173838322
Veggies                  -0.07929315 -0.06662374 -0.04283626  0.0003492577
HvyAlcoholConsump        -0.09485314 -0.02702989 -0.02544298 -0.0271461602
AnyHealthcare             0.02319075  0.03576444  0.03153180  0.1068004249
NoDocbcCost               0.04097657  0.02651701  0.03319927 -0.0626687433
GenHlth                   0.41425866  0.32403793  0.23913786  0.0602571737
MentHlth                  0.05651530  0.03889704  0.07060833 -0.0169596205
PhysHlth                  0.21550736  0.17710138  0.14880130  0.0326525481
DiffWalk                  0.27264601  0.23478391  0.16204341  0.0444303666
Sex                       0.04441286  0.04081925  0.01732446 -0.0079912226
Age                       0.26161996  0.32217979  0.21797418  0.0992080572
Education                -0.16992567 -0.14354264 -0.08317736 -0.0076929007
Income                   -0.23252895 -0.19689594 -0.11028910  0.0072711196
Education_binary         -0.10231756 -0.07871780 -0.05001763 -0.0062569876
                              BMI       Smoker       Stroke
Diabetes_binary       0.328355613  0.085998964  0.125426785
HighBP                0.275545956  0.087438297  0.129059872
HighChol              0.158339249  0.093398312  0.099786191
CholCheck             0.054396566 -0.004330515  0.022529381
BMI                   1.000000000  0.022429012  0.026133378
Smoker                0.022429012  1.000000000  0.064658397
Stroke                0.026133378  0.064658397  1.000000000
HeartDiseaseorAttack  0.074474520  0.124417535  0.223393786
PhysActivity         -0.169264726 -0.079823258 -0.079984782
Fruits               -0.091424715 -0.074810805 -0.008996297
Veggies              -0.062669226 -0.029925651 -0.047601204
HvyAlcoholConsump    -0.063087867  0.077835011 -0.023394552
AnyHealthcare        -0.005136231 -0.012938633  0.006483591
NoDocbcCost           0.061142416  0.035798890  0.036198325
GenHlth               0.284265010  0.149959748  0.182517254
MentHlth              0.086931776  0.062012320  0.068190661
PhysHlth              0.166514023  0.101202841  0.150871563
DiffWalk              0.233130063  0.119788712  0.192265923
Sex                   0.033497623  0.112125182  0.003822095
Age                  -0.038140924  0.098822294  0.123544192
Education            -0.118629606 -0.152917687 -0.071888693
Income               -0.118986237 -0.112240530 -0.134486538
Education_binary     -0.045619597 -0.060857918 -0.055726301
                     HeartDiseaseorAttack PhysActivity       Fruits
Diabetes_binary                0.21152340 -0.158665605 -0.054076556
HighBP                         0.21075039 -0.136102169 -0.040852164
HighChol                       0.18118664 -0.090453163 -0.047383624
CholCheck                      0.04349714 -0.008249363  0.017383832
BMI                            0.07447452 -0.169264726 -0.091424715
Smoker                         0.12441753 -0.079823258 -0.074810805
Stroke                         0.22339379 -0.079984782 -0.008996297
HeartDiseaseorAttack           1.00000000 -0.098223332 -0.019435823
PhysActivity                  -0.09822333  1.000000000  0.133812914
Fruits                        -0.01943582  0.133812914  1.000000000
Veggies                       -0.03631473  0.149322340  0.238605287
HvyAlcoholConsump             -0.03712977  0.019110751 -0.033245681
AnyHealthcare                  0.01568748  0.027088786  0.029385153
NoDocbcCost                    0.03602872 -0.063301661 -0.045842586
GenHlth                        0.26881454 -0.270967210 -0.099558905
MentHlth                       0.05122788 -0.097845344 -0.060024855
PhysHlth                       0.18496138 -0.208012836 -0.050328104
DiffWalk                       0.23261057 -0.276868259 -0.050784243
Sex                            0.09816137  0.051752891 -0.088723101
Age                            0.22523062 -0.092685040  0.067777788
Education                     -0.09372717  0.193831150  0.104653290
Income                        -0.14892375  0.202370499  0.076731680
Education_binary              -0.06337927  0.100844116  0.042187139
                           Veggies HvyAlcoholConsump AnyHealthcare  NoDocbcCost
Diabetes_binary      -0.0792931456      -0.094853140   0.023190749  0.040976573
HighBP               -0.0666237364      -0.027029886   0.035764442  0.026517009
HighChol             -0.0428362633      -0.025442979   0.031531795  0.033199272
CholCheck             0.0003492577      -0.027146160   0.106800425 -0.062668743
BMI                  -0.0626692263      -0.063087867  -0.005136231  0.061142416
Smoker               -0.0299256507       0.077835011  -0.012938633  0.035798890
Stroke               -0.0476012036      -0.023394552   0.006483591  0.036198325
HeartDiseaseorAttack -0.0363147331      -0.037129769   0.015687481  0.036028720
PhysActivity          0.1493223396       0.019110751   0.027088786 -0.063301661
Fruits                0.2386052867      -0.033245681   0.029385153 -0.045842586
Veggies               1.0000000000       0.022090472   0.029152204 -0.037146195
HvyAlcoholConsump     0.0220904721       1.000000000  -0.013483953  0.009682557
AnyHealthcare         0.0291522037      -0.013483953   1.000000000 -0.221657573
NoDocbcCost          -0.0371461952       0.009682557  -0.221657573  1.000000000
GenHlth              -0.1172584145      -0.058740518  -0.034006408  0.166006719
MentHlth             -0.0407673316       0.024335944  -0.046540852  0.192201880
PhysHlth             -0.0609970471      -0.035878483  -0.005765732  0.168141140
DiffWalk             -0.0840715623      -0.049293977   0.008113322  0.127110921
Sex                  -0.0526037770       0.014164398  -0.006561785 -0.048186618
Age                  -0.0172075629      -0.056264226   0.141017473 -0.141049733
Education             0.1535388035       0.033426171   0.100534473 -0.094148431
Income                0.1558164176       0.066585569   0.129948474 -0.195719144
Education_binary      0.0857534413       0.030315004   0.076029518 -0.071632300
                         GenHlth    MentHlth     PhysHlth     DiffWalk
Diabetes_binary       0.41425866  0.05651530  0.215507356  0.272646006
HighBP                0.32403793  0.03889704  0.177101385  0.234783906
HighChol              0.23913786  0.07060833  0.148801297  0.162043410
CholCheck             0.06025717 -0.01695962  0.032652548  0.044430367
BMI                   0.28426501  0.08693178  0.166514023  0.233130063
Smoker                0.14995975  0.06201232  0.101202841  0.119788712
Stroke                0.18251725  0.06819066  0.150871563  0.192265923
HeartDiseaseorAttack  0.26881454  0.05122788  0.184961384  0.232610574
PhysActivity         -0.27096721 -0.09784534 -0.208012836 -0.276868259
Fruits               -0.09955890 -0.06002486 -0.050328104 -0.050784243
Veggies              -0.11725841 -0.04076733 -0.060997047 -0.084071562
HvyAlcoholConsump    -0.05874052  0.02433594 -0.035878483 -0.049293977
AnyHealthcare        -0.03400641 -0.04654085 -0.005765732  0.008113322
NoDocbcCost           0.16600672  0.19220188  0.168141140  0.127110921
GenHlth               1.00000000  0.26951341  0.524274479  0.466539876
MentHlth              0.26951341  1.00000000  0.345336929  0.224159145
PhysHlth              0.52427448  0.34533693  1.000000000  0.461953961
DiffWalk              0.46653988  0.22415915  0.461953961  1.000000000
Sex                  -0.01403767 -0.12812104 -0.070312080 -0.082248325
Age                   0.13875512 -0.16792884  0.044261547  0.182841621
Education            -0.28153156 -0.06789651 -0.137528550 -0.199614252
Income               -0.38233755 -0.17188337 -0.257596607 -0.336425444
Education_binary     -0.18452702 -0.05223709 -0.095200381 -0.140369342
                               Sex           Age    Education      Income
Diabetes_binary       0.0444128584  0.2616199574 -0.169925666 -0.23252895
HighBP                0.0408192533  0.3221797895 -0.143542637 -0.19689594
HighChol              0.0173244557  0.2179741750 -0.083177359 -0.11028910
CholCheck            -0.0079912226  0.0992080572 -0.007692901  0.00727112
BMI                   0.0334976227 -0.0381409244 -0.118629606 -0.11898624
Smoker                0.1121251819  0.0988222937 -0.152917687 -0.11224053
Stroke                0.0038220952  0.1235441922 -0.071888693 -0.13448654
HeartDiseaseorAttack  0.0981613669  0.2252306162 -0.093727165 -0.14892375
PhysActivity          0.0517528912 -0.0926850399  0.193831150  0.20237050
Fruits               -0.0887231011  0.0677777878  0.104653290  0.07673168
Veggies              -0.0526037770 -0.0172075629  0.153538803  0.15581642
HvyAlcoholConsump     0.0141643977 -0.0562642263  0.033426171  0.06658557
AnyHealthcare        -0.0065617851  0.1410174725  0.100534473  0.12994847
NoDocbcCost          -0.0481866180 -0.1410497328 -0.094148431 -0.19571914
GenHlth              -0.0140376652  0.1387551158 -0.281531564 -0.38233755
MentHlth             -0.1281210369 -0.1679288439 -0.067896513 -0.17188337
PhysHlth             -0.0703120801  0.0442615473 -0.137528550 -0.25759661
DiffWalk             -0.0822483253  0.1828416211 -0.199614252 -0.33642544
Sex                   1.0000000000  0.0004467795  0.046773685  0.15796710
Age                   0.0004467795  1.0000000000 -0.102378905 -0.17295027
Education             0.0467736846 -0.1023789049  1.000000000  0.46334862
Income                0.1579671041 -0.1729502656  0.463348619  1.00000000
Education_binary      0.0208576036 -0.0637010995  0.473632268  0.27570314
                     Education_binary
Diabetes_binary          -0.102317558
HighBP                   -0.078717804
HighChol                 -0.050017633
CholCheck                -0.006256988
BMI                      -0.045619597
Smoker                   -0.060857918
Stroke                   -0.055726301
HeartDiseaseorAttack     -0.063379273
PhysActivity              0.100844116
Fruits                    0.042187139
Veggies                   0.085753441
HvyAlcoholConsump         0.030315004
AnyHealthcare             0.076029518
NoDocbcCost              -0.071632300
GenHlth                  -0.184527017
MentHlth                 -0.052237090
PhysHlth                 -0.095200381
DiffWalk                 -0.140369342
Sex                       0.020857604
Age                      -0.063701100
Education                 0.473632268
Income                    0.275703138
Education_binary          1.000000000

As there are many values, a plot will be displayed to make better assumptions. The selected plot has been a heatmap. But how are we supposed to interpret it? The correlation sets the ground to quantify the sign and the magnitude of the tendency between two variables.

  1. The sign denotes the direction of the variable relationship.

    -> Values above 0 show a direct or positive relationship between the variables,

    -> Values below 0 show an indirect or negative relationship,

    -> A null value shows that there does not exist any tendency between both variables.

  2. The magnitude indicates the strength of the relationship. If the magnitude value is close to the extreme of the interval (1 or -1) the trend the trend of the variables is higher. As the correlation efficient reaches zero, the trend minimizes.

    -> If the correlation value takes the value 1 or -1, we will say that the correlation is “perfect”,

    -> If the correlation value takes the value 0, we will say that the variables are not correlated.

Now that it is known how to interpret the correlation coefficients lets see the plot and analyze it.

ggplot(data = as.data.frame(as.table(cor_matrix)), 
       aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", 
                       high = "red", 
                       mid = "white", 
                       midpoint = 0,
                       limit = c(-1, 1),
                       space = "Lab",
                       name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The redder the colour, the stronger the correlation in a positive way and the bluer the colour, the stronger but in a negative way. At first glance it can be seen that the health-related variables are positively related to each other. We are talking for example about the variables PhysHlth, GenHlth and DiffWalk and that there are no pairs of variables whose relationship is particularly negative.

But what does it mean to be positively or negatively correlated? Lets see it with some examples:

  • Positive correlations may include the relationship between beer sales and the temperature, implying hot weather is related to an increase in cold beer consumption.

  • Negative correlations might look like the relationship between time spent going partying before doing an exam and a student’s exams scores.

  • Null hypothesis are simple relationships that have nothing in common, such as the height of a person and their exams scores.

We will check now the highers values for the achieved correlations.

cor_data <- as.data.frame(as.table(cor_matrix))

strong_correlations <- cor_data |>
  filter(Freq > 0.35 | Freq < -0.35)

print("Strong Correlations:")
[1] "Strong Correlations:"
print(strong_correlations)
                   Var1                 Var2       Freq
1       Diabetes_binary      Diabetes_binary  1.0000000
2                HighBP      Diabetes_binary  0.3815155
3               GenHlth      Diabetes_binary  0.4142587
4       Diabetes_binary               HighBP  0.3815155
5                HighBP               HighBP  1.0000000
6              HighChol             HighChol  1.0000000
7             CholCheck            CholCheck  1.0000000
8                   BMI                  BMI  1.0000000
9                Smoker               Smoker  1.0000000
10               Stroke               Stroke  1.0000000
11 HeartDiseaseorAttack HeartDiseaseorAttack  1.0000000
12         PhysActivity         PhysActivity  1.0000000
13               Fruits               Fruits  1.0000000
14              Veggies              Veggies  1.0000000
15    HvyAlcoholConsump    HvyAlcoholConsump  1.0000000
16        AnyHealthcare        AnyHealthcare  1.0000000
17          NoDocbcCost          NoDocbcCost  1.0000000
18      Diabetes_binary              GenHlth  0.4142587
19              GenHlth              GenHlth  1.0000000
20             PhysHlth              GenHlth  0.5242745
21             DiffWalk              GenHlth  0.4665399
22               Income              GenHlth -0.3823375
23             MentHlth             MentHlth  1.0000000
24              GenHlth             PhysHlth  0.5242745
25             PhysHlth             PhysHlth  1.0000000
26             DiffWalk             PhysHlth  0.4619540
27              GenHlth             DiffWalk  0.4665399
28             PhysHlth             DiffWalk  0.4619540
29             DiffWalk             DiffWalk  1.0000000
30                  Sex                  Sex  1.0000000
31                  Age                  Age  1.0000000
32            Education            Education  1.0000000
33               Income            Education  0.4633486
34     Education_binary            Education  0.4736323
35              GenHlth               Income -0.3823375
36            Education               Income  0.4633486
37               Income               Income  1.0000000
38            Education     Education_binary  0.4736323
39     Education_binary     Education_binary  1.0000000

As mentioned before, most of the highest correlation values ​​refer to the relationship of variables in the world of health. The higher positive value is achieved between PhysHlth and GenHlth. This means that the lower value for PhysHlth the lower value it will be achieved in GenHlth (lower values in those variables are a good sign in this dataset). It’s something that makes a lot of sense. One thing that we found curious is the correlation between GenHlth and Income. Both get the most negative value, and with the previous explanation about what it means to be negatively correlated, this value tells us that the higher the value of GenHlth (a value of 5 means that health is poor) the smaller the value will be. Income (a value of 1 indicates that the patient does not have much money). Thus, this could lead to the conclusion that people with more money are more healthy.

Once this analysis has been carried out, we believe that it is interesting to simply look at the relationship of the rest of the variables with the variable associated with diabetes. We will follow exactly the same process as before.

df_numeric <- df |>
  select_if(is.numeric) |>
  select(Diabetes_binary, everything())

cor_with_diabetes <- df_numeric |>
  cor(df_numeric$Diabetes_binary, method = "spearman")

After generating the correlation matrix only related to the Diabetes_binary variable, we will generate a heatmap to take a look at the results.

plot_3 <- ggplot(data = as.data.frame(as.table(cor_with_diabetes)), 
       aes(x = Var2, 
           y = Var1, 
           fill = Freq)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", 
                       high = "red", 
                       mid = "white", 
                       midpoint = 0, 
                       limit = c(-1, 1), 
                       space = "Lab", 
                       name = "Correlation") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1))

plot_3

We will print the results to have a better understanding of the results.

print(cor_with_diabetes)
                            [,1]
Diabetes_binary       1.00000000
HighBP                0.38151555
HighChol              0.28921281
CholCheck             0.11538162
BMI                   0.32835561
Smoker                0.08599896
Stroke                0.12542678
HeartDiseaseorAttack  0.21152340
PhysActivity         -0.15866560
Fruits               -0.05407656
Veggies              -0.07929315
HvyAlcoholConsump    -0.09485314
AnyHealthcare         0.02319075
NoDocbcCost           0.04097657
GenHlth               0.41425866
MentHlth              0.05651530
PhysHlth              0.21550736
DiffWalk              0.27264601
Sex                   0.04441286
Age                   0.26161996
Education            -0.16992567
Income               -0.23252895
Education_binary     -0.10231756

Overall, we can achieve the same conclusions as before. Health related variables are positively correlated with diabates variable. It is something that everybody would expect but it was nice taking a look to it. Age has also something to do with having diabetes or not, as well as having difficulties to walk. Those are really interesting facts, but which are the top three correlated variables with diabetes?

cor_with_diabetes_data <- as.data.frame(as.table(cor_with_diabetes))

strong_correlations <- cor_with_diabetes_data |>
  filter(Freq > 0.30 | Freq < -0.30)

print("Strong Correlations:")
[1] "Strong Correlations:"
print(strong_correlations)
             Var1 Var2      Freq
1 Diabetes_binary    A 1.0000000
2          HighBP    A 0.3815155
3             BMI    A 0.3283556
4         GenHlth    A 0.4142587

To conclude the correlation analysis, we see that the top three most correlated variables with diabetes are (in order): GenHlth, HighBP and BMI. With these variables we could say that:

  1. Having a low value in the GenHlth variable (the lowest the value, the healthier) will lead to have a low value in Diabates_binary (0 means no diabetes). Thus, the better the general health of a person is could lead to not having diabetes.

  2. Having a low value in HighBP variable (0 means that you do not have high blood pressure) will lead to have a low value in Diabates_binary (0 means no diabetes). Thus, the lower the blood pressure of a person is could lead to not having diabetes.

  3. Having a low value in BMI variable (low values indicate that your body mass is not high) will lead to have a low value in Diabates_binary (0 means no diabetes). Thus, the lower the BMI of a person is could lead to not having diabetes.

Generalized Linear Model (GLM)

In this section of the analysis we will work on two main tasks. The first one will be the creation and interpretation of a GLM with all the numerical variables available in the dataset. The second task will be related to applying some methods to improve this model and to check which are the most valuable features for the model.

Having said that, a Generalized Linear Model (GLM) is a statistical model used to analyze relationships between variables, accommodating various types of data and allowing for the prediction of one variable based on others. In our case, the data that we will use is the dataset formed by all the numerical variables. We want to predict whether a patient will have diabetes or not using all the available numerical variables. Finally, family = binomial indicates that we are fitting a model for binary or binomial response variables. This is suitable when the outcome variable is binary, meaning it has only two possible outcomes, such as success/failure, yes/no, or 0/1. In this case, our outcome variable is ‘Diabetes_binary’ which only has 0 (non-diabetic) and 1 (diabetic).

Lets create the model and interpret its results.

model_all <- glm(data = df_numeric,
                Diabetes_binary ~ .,
                family = binomial)
tidy_summary <- tidy(model_all)
glance_summary <- glance(model_all)

full_summary <- bind_cols(tidy_summary, glance_summary)

print(full_summary)
# A tibble: 23 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc… -6.87      0.125     -55.1   0                98000.   70691 -36194.
 2 HighBP    0.735     0.0197     37.3   9.85e-304        98000.   70691 -36194.
 3 HighChol  0.587     0.0189     31.1   7.02e-213        98000.   70691 -36194.
 4 CholChe…  1.36      0.0813     16.7   7.15e- 63        98000.   70691 -36194.
 5 BMI       0.0756    0.00157    48.0   0                98000.   70691 -36194.
 6 Smoker   -0.00192   0.0189     -0.101 9.19e-  1        98000.   70691 -36194.
 7 Stroke    0.162     0.0409      3.96  7.42e-  5        98000.   70691 -36194.
 8 HeartDi…  0.253     0.0284      8.88  6.42e- 19        98000.   70691 -36194.
 9 PhysAct… -0.0329    0.0213     -1.55  1.22e-  1        98000.   70691 -36194.
10 Fruits   -0.0344    0.0196     -1.75  7.95e-  2        98000.   70691 -36194.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

Running summary(model_all) gives us the summary of the fitted model. We will take a look to the most important aspects:

  1. p-values: Low p-values (typically below 0.05) suggest that the corresponding predictor variable is statistically significant in predicting the outcome. These variables can be seen in the model summary with the *** symbol. Most of them are the ones related with health (as mentioned in the correlation analysis). Here it seems interesting that being a smoker or not is not significant for diabetes outcome. However, looks like ‘Sex’ and ‘Income’ are really relevant for the outcome.

  2. Deviance and Residual Deviance: Deviance is a measure of model fit. We compare the deviance of our model to a null model (a model with no predictors) to assess how well our model explains the variability in the data. Lower deviance indicates a better fit and it can be seen that our model achieves that.

  3. AIC (Akaike Information Criterion): AIC is a measure of model performance that penalizes for the number of parameters in the model. Lower AIC values indicate a better trade-off between model complexity and fit.

Now that we have our model with all the variables, we will try to improve it by applying the step method. This is a method that is used for selecting a subset of predictors to include in a model. This process involves adding or removing variables one at a time based on statistical criteria (such as p-values) to improve the model’s fit or simplicity. We will use two different approaches:

  • forward: Starts with no predictors and adds them one at a time, selecting the one that improves the model the most at each step.

  • backward: Starts with all predictors and removes them one at a time, excluding the one that contributes the least at each step.

Both methods aim to find an optimal subset of predictors for a model.

forward_model <- step(model_all, 
                      direction = "forward")
Start:  AIC=72433.85
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Smoker + 
    Stroke + HeartDiseaseorAttack + PhysActivity + Fruits + Veggies + 
    HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth + 
    MentHlth + PhysHlth + DiffWalk + Sex + Age + Education + 
    Income + Education_binary
backward_model <- step(model_all,
                       direction = "backward")
Start:  AIC=72433.85
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Smoker + 
    Stroke + HeartDiseaseorAttack + PhysActivity + Fruits + Veggies + 
    HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth + 
    MentHlth + PhysHlth + DiffWalk + Sex + Age + Education + 
    Income + Education_binary

                       Df Deviance   AIC
- Smoker                1    72388 72432
- Education_binary      1    72388 72432
- NoDocbcCost           1    72388 72432
- AnyHealthcare         1    72389 72433
<none>                       72388 72434
- PhysActivity          1    72390 72434
- Fruits                1    72391 72435
- Veggies               1    72395 72439
- Education             1    72398 72442
- MentHlth              1    72399 72443
- Stroke                1    72404 72448
- DiffWalk              1    72408 72452
- PhysHlth              1    72437 72481
- HeartDiseaseorAttack  1    72468 72512
- Income                1    72516 72560
- Sex                   1    72583 72627
- HvyAlcoholConsump     1    72636 72680
- CholCheck             1    72726 72770
- HighChol              1    73352 73396
- HighBP                1    73764 73808
- Age                   1    73979 74023
- BMI                   1    74997 75041
- GenHlth               1    75131 75175

Step:  AIC=72431.86
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke + 
    HeartDiseaseorAttack + PhysActivity + Fruits + Veggies + 
    HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth + 
    MentHlth + PhysHlth + DiffWalk + Sex + Age + Education + 
    Income + Education_binary

                       Df Deviance   AIC
- Education_binary      1    72388 72430
- NoDocbcCost           1    72388 72430
- AnyHealthcare         1    72390 72432
<none>                       72388 72432
- PhysActivity          1    72390 72432
- Fruits                1    72391 72433
- Veggies               1    72395 72437
- Education             1    72398 72440
- MentHlth              1    72399 72441
- Stroke                1    72404 72446
- DiffWalk              1    72408 72450
- PhysHlth              1    72437 72479
- HeartDiseaseorAttack  1    72468 72510
- Income                1    72516 72558
- Sex                   1    72586 72628
- HvyAlcoholConsump     1    72639 72681
- CholCheck             1    72726 72768
- HighChol              1    73353 73395
- HighBP                1    73764 73806
- Age                   1    73983 74025
- BMI                   1    75001 75043
- GenHlth               1    75133 75175

Step:  AIC=72430.01
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke + 
    HeartDiseaseorAttack + PhysActivity + Fruits + Veggies + 
    HvyAlcoholConsump + AnyHealthcare + NoDocbcCost + GenHlth + 
    MentHlth + PhysHlth + DiffWalk + Sex + Age + Education + 
    Income

                       Df Deviance   AIC
- NoDocbcCost           1    72388 72428
- AnyHealthcare         1    72390 72430
<none>                       72388 72430
- PhysActivity          1    72390 72430
- Fruits                1    72391 72431
- Veggies               1    72395 72435
- MentHlth              1    72400 72440
- Education             1    72401 72441
- Stroke                1    72404 72444
- DiffWalk              1    72408 72448
- PhysHlth              1    72437 72477
- HeartDiseaseorAttack  1    72468 72508
- Income                1    72516 72556
- Sex                   1    72586 72626
- HvyAlcoholConsump     1    72639 72679
- CholCheck             1    72726 72766
- HighChol              1    73353 73393
- HighBP                1    73765 73805
- Age                   1    73983 74023
- BMI                   1    75004 75044
- GenHlth               1    75133 75173

Step:  AIC=72428.32
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke + 
    HeartDiseaseorAttack + PhysActivity + Fruits + Veggies + 
    HvyAlcoholConsump + AnyHealthcare + GenHlth + MentHlth + 
    PhysHlth + DiffWalk + Sex + Age + Education + Income

                       Df Deviance   AIC
- AnyHealthcare         1    72390 72428
<none>                       72388 72428
- PhysActivity          1    72391 72429
- Fruits                1    72391 72429
- Veggies               1    72395 72433
- MentHlth              1    72400 72438
- Education             1    72401 72439
- Stroke                1    72404 72442
- DiffWalk              1    72408 72446
- PhysHlth              1    72437 72475
- HeartDiseaseorAttack  1    72469 72507
- Income                1    72520 72558
- Sex                   1    72586 72624
- HvyAlcoholConsump     1    72639 72677
- CholCheck             1    72726 72764
- HighChol              1    73354 73392
- HighBP                1    73766 73804
- Age                   1    74007 74045
- BMI                   1    75004 75042
- GenHlth               1    75142 75180

Step:  AIC=72427.78
Diabetes_binary ~ HighBP + HighChol + CholCheck + BMI + Stroke + 
    HeartDiseaseorAttack + PhysActivity + Fruits + Veggies + 
    HvyAlcoholConsump + GenHlth + MentHlth + PhysHlth + DiffWalk + 
    Sex + Age + Education + Income

                       Df Deviance   AIC
<none>                       72390 72428
- PhysActivity          1    72392 72428
- Fruits                1    72393 72429
- Veggies               1    72397 72433
- MentHlth              1    72401 72437
- Education             1    72402 72438
- Stroke                1    72406 72442
- DiffWalk              1    72410 72446
- PhysHlth              1    72438 72474
- HeartDiseaseorAttack  1    72470 72506
- Income                1    72520 72556
- Sex                   1    72587 72623
- HvyAlcoholConsump     1    72641 72677
- CholCheck             1    72734 72770
- HighChol              1    73356 73392
- HighBP                1    73768 73804
- Age                   1    74053 74089
- BMI                   1    75006 75042
- GenHlth               1    75142 75178
cat("\nForward Selection Model Summary:\n")

Forward Selection Model Summary:
tidy_summary_for <- tidy(forward_model)
glance_summary_for <- glance(forward_model)
full_summary_for <- bind_cols(tidy_summary_for, glance_summary_for)
print(full_summary_for)
# A tibble: 23 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc… -6.87      0.125     -55.1   0                98000.   70691 -36194.
 2 HighBP    0.735     0.0197     37.3   9.85e-304        98000.   70691 -36194.
 3 HighChol  0.587     0.0189     31.1   7.02e-213        98000.   70691 -36194.
 4 CholChe…  1.36      0.0813     16.7   7.15e- 63        98000.   70691 -36194.
 5 BMI       0.0756    0.00157    48.0   0                98000.   70691 -36194.
 6 Smoker   -0.00192   0.0189     -0.101 9.19e-  1        98000.   70691 -36194.
 7 Stroke    0.162     0.0409      3.96  7.42e-  5        98000.   70691 -36194.
 8 HeartDi…  0.253     0.0284      8.88  6.42e- 19        98000.   70691 -36194.
 9 PhysAct… -0.0329    0.0213     -1.55  1.22e-  1        98000.   70691 -36194.
10 Fruits   -0.0344    0.0196     -1.75  7.95e-  2        98000.   70691 -36194.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>
cat("\nBackward Selection Model Summary:\n")

Backward Selection Model Summary:
tidy_summary_back <- tidy(backward_model)
glance_summary_back <- glance(backward_model)
full_summary_back <- bind_cols(tidy_summary_back, glance_summary_back)
print(full_summary_back)
# A tibble: 19 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc… -6.82      0.119      -57.3  0                98000.   70691 -36195.
 2 HighBP    0.736     0.0197      37.3  4.53e-304        98000.   70691 -36195.
 3 HighChol  0.588     0.0188      31.2  2.14e-213        98000.   70691 -36195.
 4 CholChe…  1.37      0.0810      16.9  7.10e- 64        98000.   70691 -36195.
 5 BMI       0.0756    0.00157     48.1  0                98000.   70691 -36195.
 6 Stroke    0.162     0.0409       3.96 7.35e-  5        98000.   70691 -36195.
 7 HeartDi…  0.253     0.0284       8.90 5.80e- 19        98000.   70691 -36195.
 8 PhysAct… -0.0326    0.0213      -1.53 1.25e-  1        98000.   70691 -36195.
 9 Fruits   -0.0346    0.0196      -1.77 7.74e-  2        98000.   70691 -36195.
10 Veggies  -0.0611    0.0233      -2.62 8.85e-  3        98000.   70691 -36195.
11 HvyAlco… -0.751     0.0486     -15.5  7.42e- 54        98000.   70691 -36195.
12 GenHlth   0.585     0.0114      51.1  0                98000.   70691 -36195.
13 MentHlth -0.00432   0.00128     -3.38 7.24e-  4        98000.   70691 -36195.
14 PhysHlth -0.00827   0.00119     -6.94 3.83e- 12        98000.   70691 -36195.
15 DiffWalk  0.116     0.0258       4.50 6.93e-  6        98000.   70691 -36195.
16 Sex       0.266     0.0190      14.0  1.28e- 44        98000.   70691 -36195.
17 Age       0.153     0.00383     39.8  0                98000.   70691 -36195.
18 Educati… -0.0360    0.0102      -3.54 4.05e-  4        98000.   70691 -36195.
19 Income   -0.0584    0.00513    -11.4  4.68e- 30        98000.   70691 -36195.
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

After printing their summaries, we will select the model with the lowest AIC.

AIC_full <- AIC(model_all)

AIC_forward <- AIC(forward_model)

AIC_backward <- AIC(backward_model)

cat("AIC for Full Model:", AIC_full, "\n")
AIC for Full Model: 72433.85 
cat("AIC for Forward Model:", AIC_forward, "\n")
AIC for Forward Model: 72433.85 
cat("AIC for Backward Model:", AIC_backward, "\n")
AIC for Backward Model: 72427.78 

It can be seen that the model with the lowest AIC is the one created by step backward. As we are intrigued about it, we will check how many variables it contains and which are those variables.

all_variables <- names(coef(model_all))

selected_variables <- names(coef(backward_model)[coef(backward_model) != 0])

not_selected_variables <- setdiff(all_variables, selected_variables)

count_selected_variables <- length(selected_variables)

cat("Number of Variables selected by Backward Model:", count_selected_variables, "\n\n")
Number of Variables selected by Backward Model: 19 
cat("Selected Variables:", paste(selected_variables, collapse = ", "), "\n\n")
Selected Variables: (Intercept), HighBP, HighChol, CholCheck, BMI, Stroke, HeartDiseaseorAttack, PhysActivity, Fruits, Veggies, HvyAlcoholConsump, GenHlth, MentHlth, PhysHlth, DiffWalk, Sex, Age, Education, Income 
cat("Variables in 'model_all' but not selected by Backward Model:", paste(not_selected_variables, collapse = ", "), "\n")
Variables in 'model_all' but not selected by Backward Model: Smoker, AnyHealthcare, NoDocbcCost, Education_binary 

The backward model is selecting 19 variables out of the 23 variables that is using the full model. There are three variables that the most optimal model is not taking into account and this ones are Smoker, AnyHealthcare, NoDocbcCost and Education_binary. This is something that we could have predicted because if we take a look to the summary of the model with all the variables, these three variables are the ones which has the highest p-value. Thus, they are the less significant for the first model.

Saving Plots

plot_3 |> 
  ggsave(filename = "05_key_plot_3.png", path = "../results/")
Saving 7 x 5 in image
library("tidyverse")
library("broom")

Load data

df <- read_tsv("../data/03_dat_aug.tsv")
Rows: 70692 Columns: 32
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr  (9): Smoking_Status, Diabetes_Status, Sex_character, Age_Range, Income_...
dbl (23): Diabetes_binary, HighBP, HighChol, CholCheck, BMI, Smoker, Stroke,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(df)
# A tibble: 6 × 32
  Diabetes_binary HighBP HighChol CholCheck   BMI Smoker Stroke
            <dbl>  <dbl>    <dbl>     <dbl> <dbl>  <dbl>  <dbl>
1               0      1        0         1    26      0      0
2               0      1        1         1    26      1      1
3               0      0        0         1    26      0      0
4               0      1        1         1    28      1      0
5               0      0        0         1    29      1      0
6               0      0        0         1    18      0      0
# ℹ 25 more variables: HeartDiseaseorAttack <dbl>, PhysActivity <dbl>,
#   Fruits <dbl>, Veggies <dbl>, HvyAlcoholConsump <dbl>, AnyHealthcare <dbl>,
#   NoDocbcCost <dbl>, GenHlth <dbl>, MentHlth <dbl>, PhysHlth <dbl>,
#   DiffWalk <dbl>, Sex <dbl>, Age <dbl>, Education <dbl>, Income <dbl>,
#   Smoking_Status <chr>, Diabetes_Status <chr>, Sex_character <chr>,
#   Age_Range <chr>, Income_Class <chr>, Physically_Active <chr>, Habits <chr>,
#   Health_Risk <chr>, Education_binary <dbl>, SE_Background <chr>

The second script of analysis will be related to the performing PCA and the creation of some models with different variations. These variations are based on the creation of three different models. The first of them will use the components used to reach 80% variability after the PCA analysis combined with the use of logistic regression for prediction. The second and third ones will be different GLM for men and women. Thanks to this we will be able to analyze whether gender has a special importance when predicting patients with diabetes.

PCA

It has to be mentioned that we have used the code provided by the lecturers (PCA tidyverse) and apply it using our data to achieve some conclusions. That is why the code will be the achieved conclusions will be completely different. Thus, that is why we want to highlight that we have been able to understand how Principal Component Analysis (PCA) works thanks to this code and how to interpret the obtained results. Lets start with PCA.

PCA is a dimensionality reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. PCA can be used for many purposes, however, we have opted to apply this technique to create models with less variables to try to gain more meaningful conclusions of them. Thus, we are using it to reduce the dimensionality to simplify our data set with many variables, to make it more manageable for analysis.

Having said that, first thing that has to be done is to work only with numerical variables. PCA is typically applied to continuous variables, and it assumes that the variables are on a numerical scale. PCA is based on the covariance matrix, which involves computing variances and covariances between numeric variables. Therefore, directly applying PCA to a dataset with categorical variables is not appropriate However, there are techniques that extend PCA to handle categorical variables. One such technique is called Multiple Correspondence Analysis (MCA), which is an extension of PCA for categorical data. MCA is suitable for datasets where variables are categorical and can take more than two levels. As we are not working with categorical data, we will forget about this idea.

column_classes <- df |>
  map_chr(class)

numeric_columns <- names(which(column_classes == "numeric")) |>
  print()
 [1] "Diabetes_binary"      "HighBP"               "HighChol"            
 [4] "CholCheck"            "BMI"                  "Smoker"              
 [7] "Stroke"               "HeartDiseaseorAttack" "PhysActivity"        
[10] "Fruits"               "Veggies"              "HvyAlcoholConsump"   
[13] "AnyHealthcare"        "NoDocbcCost"          "GenHlth"             
[16] "MentHlth"             "PhysHlth"             "DiffWalk"            
[19] "Sex"                  "Age"                  "Education"           
[22] "Income"               "Education_binary"    

After checking which are the numerical variables of our data set, we will only select those ones to perform the analysis. Before using PCA, we have to make sure that all our data is playing on a level field. Scaling means adjusting the values so they’re all in a similar range.

pca_fit <- df |> 
  select_if(where(is.numeric)) |>
  prcomp(scale = TRUE)

Now, we will create a special kind of map for our data using PCA. To do this, we blend the PCA results with the original data, adding back the information we temporarily set aside. It’s like bringing back the colors to our points based on categories that were there in the first place but were temporarily taken away for PCA. We use a tool called augment() from the “broom” package to make this happen. It needs the model we created and the original data as inputs.

df_augmented <- augment(pca_fit, df)

ggplot(df_augmented, aes(.fittedPC1, .fittedPC2, color = Diabetes_Status)) + 
  geom_point(size = 3, alpha = 0.7, shape = 16) +
  scale_color_manual( values = c("Non-Diabetic" = "#FF5733", "Diabetic" = "#33FF57")) +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        panel.background = element_rect(fill = "lightgray"))

It can be noticed that the difference among having diabetes or not is being quite well distinguished in the two first components. We will continue by creating the rotation matrix. In PCA this matrix is like a set of instructions that turns our data points into a new set of directions. It helps to highlight the most important information in the data. We need to think of it as finding the best view tu understand patterns in the data.

pca_fit |>
  tidy(matrix = "rotation")
# A tibble: 529 × 3
   column             PC   value
   <chr>           <dbl>   <dbl>
 1 Diabetes_binary     1  0.282 
 2 Diabetes_binary     2 -0.250 
 3 Diabetes_binary     3  0.0348
 4 Diabetes_binary     4  0.0538
 5 Diabetes_binary     5 -0.192 
 6 Diabetes_binary     6 -0.229 
 7 Diabetes_binary     7  0.0411
 8 Diabetes_binary     8 -0.0303
 9 Diabetes_binary     9 -0.0329
10 Diabetes_binary    10  0.0256
# ℹ 519 more rows

Finally we will look at the variance explained by each PC. To do this, we will compute the eigenvalues. But, what is an eigenvalue representing? An eigenvalue in PCA represents the amount of variance captured by a principal component. It tells us how much information that component holds: larger eigenvalues mean more important components.

pca_fit |>
  tidy(matrix = "eigenvalues")
# A tibble: 23 × 4
      PC std.dev percent cumulative
   <dbl>   <dbl>   <dbl>      <dbl>
 1     1   2.00   0.175       0.175
 2     2   1.36   0.0807      0.255
 3     3   1.22   0.0646      0.320
 4     4   1.15   0.0571      0.377
 5     5   1.10   0.0526      0.430
 6     6   1.06   0.0488      0.478
 7     7   1.02   0.0451      0.523
 8     8   0.989  0.0426      0.566
 9     9   0.970  0.0409      0.607
10    10   0.939  0.0383      0.645
# ℹ 13 more rows

Here we can see those results in a plot.

pca_fit |>
  tidy(matrix = "eigenvalues") |>
  ggplot(aes(PC, percent)) +
  geom_bar(fill = "#66c2a5", alpha = 0.7, stat = "identity") +
  scale_x_continuous(breaks = 1:22) +
  theme_minimal()

Having reached this part of the project, we now want to only select a certain amount of components to be used in our models. Selecting the number of components in PCA depends on our goals and the amount of data we are willing to retain. A common approach is to choose enough components to capture a significant portion of the total variance, often aiming for around 70-80%. This is of course a trade-off. The more components we have, the more information we will have. However, more noise will be introduced.

To help us finding the desired number of components, we will create a scree plot and select those whose cumulative sum achieves an 80% of cumulative variance.

cumulative_var <- cumsum(pca_fit$sdev^2) / sum(pca_fit$sdev^2)

plot_data <- data.frame(
  PC = seq_along(cumulative_var),
  Cumulative_Proportion = cumulative_var
)

plot_4 <- ggplot(plot_data, aes(x = PC, y = Cumulative_Proportion)) +
  geom_line() +
  geom_point() +
  labs(
    x = "Number of Principal Components",
    y = "Cumulative Proportion of Variance Explained"
  ) +
  geom_hline(yintercept = 0.7, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 0.75, linetype = "dashed", color = "blue") +
  geom_hline(yintercept = 0.8, linetype = "dashed", color = "green")

plot_4

Looking at the results of the plot, we will select a total of 15 components.

Prediction model

After performing PCA and selecting the appropiate number of components, we will build a simple logistic regression model for prediction tasks. In our case, the target variable will be Diabetes_binay and the following procedure will be made:

  1. Dimensionality reduction: We will select 15 principal components from the results of a our PCA analysis (pca_fit).

  2. Data splitting: The dataset is divided into training and testing sets (70% training, 30% testing) using random indices.

  3. Model training: A logistic regression model (glm) is trained on the training data, where the outcome variable is binary (Diabetes_binary).

  4. Prediction and evaluation: The model is used to predict outcomes on the test data. Predictions are converted to binary (0 or 1) based on a threshold of 0.5. To evaluate the model’s performance, a confusion matrix is created and accuracy is computed as well.

num_components <- 15

selected_components <- as.data.frame(pca_fit$x[, 1:num_components])

set.seed(123) 
train_indices <- sample(seq_len(nrow(df)), 0.7 * nrow(df))

train_data <- selected_components |>
  slice(train_indices)

train_outcome <- df$Diabetes_binary[train_indices]

test_data <- selected_components |>
  slice(-train_indices)

test_outcome <- df$Diabetes_binary[-train_indices]

model <- glm(train_outcome ~ ., family = "binomial", data = cbind(train_outcome, train_data))

predictions <- predict(model, newdata = as.data.frame(test_data), type = "response")

binary_predictions <- ifelse(predictions > 0.5, 1, 0)

conf_matrix <- table(Predicted = binary_predictions, Actual = test_outcome)

accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)

conf_matrix |>
  print()
         Actual
Predicted    0    1
        0 9179 1316
        1 1430 9283
cat("Accuracy:", accuracy, "\n")
Accuracy: 0.8705206 

One of the methods used to check the model’s performance is a confusion matrix. It is a table that helps evaluate the performance of a classification model. It compares the predicted values to the actual values in a dataset and consists of four components. Those components are then used to compute some other metrics such as accuracy (the one that we have printed). The results obtained by applying the selected 15 components are quite good as we are achieving a value of 87% in accuracy.

Men VS. Women (GLM)

The last part of the analysis will be focused in detecting whether a GLM works better for men or women. We want to help tackling gender bias in science. This means the data might not treat men and women fairly. This could happen if there aren’t enough women included in studies, or if the way data is collected or analyzed favors one gender over the other. Fixing this means making sure studies include a good mix of men and women and treating everyone equally in how data is handled.

First of all we will take a look to the distribution between men and women in our data set.

df |> 
  count(Sex)
# A tibble: 2 × 2
    Sex     n
  <dbl> <int>
1     0 38386
2     1 32306

Value 0 means women and value 1 means men. So according to our data, we have slightly more women individuals rather than men. Lets generate two different datasets according to gender and apply logistic regression to see their performances.

women_data <- df |>
  filter(Sex == 0) |>
  select_if(is.numeric)

men_data <- df |>
  filter(Sex == 1) |>
  select_if(is.numeric)

Lets use first women data in the model.

model_women <- glm(data = women_data,
                Diabetes_binary ~ .,
                family = binomial)

tidy_summary_women <- tidy(model_women)
glance_summary_women <- glance(model_women)
full_summary_women <- bind_cols(tidy_summary_women, glance_summary_women)
print(full_summary_women)
# A tibble: 23 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc…  -6.52     0.177      -36.9  2.69e-297        53151.   38385 -18999.
 2 HighBP     0.809    0.0275      29.4  1.03e-189        53151.   38385 -18999.
 3 HighChol   0.664    0.0262      25.4  6.67e-142        53151.   38385 -18999.
 4 CholChe…   1.36     0.119       11.4  2.56e- 30        53151.   38385 -18999.
 5 BMI        0.0760   0.00203     37.4  8.31e-306        53151.   38385 -18999.
 6 Smoker    -0.0355   0.0262      -1.36 1.75e-  1        53151.   38385 -18999.
 7 Stroke     0.192    0.0562       3.42 6.26e-  4        53151.   38385 -18999.
 8 HeartDi…   0.245    0.0435       5.64 1.73e-  8        53151.   38385 -18999.
 9 PhysAct…  -0.0333   0.0289      -1.15 2.49e-  1        53151.   38385 -18999.
10 Fruits     0.0332   0.0278       1.20 2.32e-  1        53151.   38385 -18999.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

Now lets create a model for men data.

model_men <- glm(data = men_data,
                Diabetes_binary ~ .,
                family = binomial)

tidy_summary_men <- tidy(model_men)
glance_summary_men <- glance(model_men)
full_summary_men <- bind_cols(tidy_summary_men, glance_summary_men)
print(full_summary_men)
# A tibble: 23 × 13
   term     estimate std.error statistic   p.value null.deviance df.null  logLik
   <chr>       <dbl>     <dbl>     <dbl>     <dbl>         <dbl>   <int>   <dbl>
 1 (Interc…  -6.83     0.177     -38.5   0                44710.   32305 -17109.
 2 HighBP     0.676    0.0285     23.7   8.04e-124        44710.   32305 -17109.
 3 HighChol   0.516    0.0274     18.8   3.98e- 79        44710.   32305 -17109.
 4 CholChe…   1.33     0.111      12.0   4.45e- 33        44710.   32305 -17109.
 5 BMI        0.0722   0.00250    28.8   6.66e-183        44710.   32305 -17109.
 6 Smoker     0.0232   0.0276      0.840 4.01e-  1        44710.   32305 -17109.
 7 Stroke     0.124    0.0598      2.08  3.78e-  2        44710.   32305 -17109.
 8 HeartDi…   0.241    0.0377      6.40  1.52e- 10        44710.   32305 -17109.
 9 PhysAct…  -0.0357   0.0316     -1.13  2.60e-  1        44710.   32305 -17109.
10 Fruits    -0.105    0.0277     -3.77  1.61e-  4        44710.   32305 -17109.
# ℹ 13 more rows
# ℹ 5 more variables: AIC <dbl>, BIC <dbl>, deviance <dbl>, df.residual <int>,
#   nobs <int>

The results show how the model containing only male patients achieves a lower AIC than the model with only female patients. It is true that the difference is not very big. It can be seen how the male model gives more importance to variables related to physical health as well as the consumption of fruits instead of vegetables. The latter may make sense since there is a greater amount of sugar in fruits than in vegetables.

Something that should be highlighted is that it has a significant improvement with the GLM built in the first part of the analysis. It manages to lower the value of the AIC a lot. This is something that we did not have in mind that could happen and that is why it can be considered as a task to analyze in the future.

Saving Plots

plot_4 |> 
  ggsave(filename = "06_key_plot_4.png", path = "../results/")
Saving 7 x 5 in image